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look deeper. 




EXL Decision Analytics Methodology Snapshot 



We apply a set of highly effective tools, techniques and best practices for the end-to-end model development cycle 


Stage 1 


Stage 2 


Stage 3 


Stage 4 


Preliminary Data 
Exploration 


Data Preparation 


Variable Creation 


Variable Reduction 


Univariate Analysis (EDD*) 
Modeling and Validation Split 
Bivariate Analysis 

Outlier Treatment 
Missing Imputation 
Roll Ups and Data Merge 


These stages 
demand lot of manual 
effort in analyzing 
and understanding 
each and every 
variable 


Dummy Variable Creation 
Binning and Banding 
Transformations 
Interactions and Groupings 

Variable Clustering 

Inter-Correlation Analysis 

Variance Inflation Factor Test 


These stages require 
business sense and 
out-of-box thinking 
for brainstorming on 
creating hypothesis- 
based variables and 
dropping redundant 
features 


Stage 5 Modeling 


Stage 6 


Validation and 
Stabilization 
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Modeling Technique Selection 

Model Improvements 

Ensemble 


In-Sample Validation 

Out-of-Time Validation 
Bootstrapping 

Coefficient Blasting 

* Extended Data Dictionary 



These stages require 
good knowledge of 
statistical techniques 
for providing high- 
end quality solutions 









































Objectives and Scope 



Course Goals 

To provide a structured overview of model validation and stabilization techniques used during application 
of EXL DA methodology 

To introduce trainees to several model performance and stability measures 
To explain metric calculations through illustrations 

Hands on exercises on real life data to practice calculation of validation metrics during the training course 
To provide helpful “tricks of the trade” 


Beyond the Scope of this Training 

Comprehensive coaching on model validation 

Derivation of statistical formulas or terms (unless required as part of methodology explanation) 


Self Study Goals 

Model validation practice on hypothetical data 
■8 In-depth research on advanced concepts relating to validation and stabilization 
Discussion on advanced concepts can be taken up offline 
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Chapter 1: Basics of Model Validation 


Xexl 
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1.1 Model Validation 



1.1.1. Need for Validation 


What is Model Validation? 

Model validation is a process of determining the degree to which a statistical software 
generated model (based on input data) is an accurate representation of the real world 



Why is Validation Needed? 

■i Generalization 

To ascertain whether predicted values from the model are likely to accurately predict responses on future 
subjects or subjects not used to develop the model 

n Stability Check 

To test how consistently the model is going to perform over time 

■i Robustness Check 

To test whether the model is an appropriate representation of the real world for the stated purpose and 
whether the model is acceptable for its intended use 


A model without sufficient validation is only a hypothesis. 
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1 . 1 . 2 . Types of Validation 


Type 

Description 

Technique 

Validity Strength 

Apparent 

Performance on sample used to develop model 

Apparent 

i 

Internal 

Performance on population underlying the sample 

Split Sample 

ii 

(Out-of-Sample) 


Cross Validation 

in 



Bootstrapping 

■m 

External 

Performance on related but slightly different population 

Out-of-time (OOT) 

mil 



Spatial Validation 

mm 



Fully External Validation 

■mm 


Apparent Validation I Internal Validation I External Validation 


Measures model performance on 
modeling data itself; there is no 
significant value add 
Provides optimistic estimates of 
model performance 
Very easy to implement 
Validity strength is very low; 
Implementation of such model in 
real world may show 
disappointing results 
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Data for model development and 
evaluation are both random 
samples from the same 
underlying population 
Provides honest and reasonable 
estimates of model performance 
Sets an upper limit to what may 
be expected in external validation 
Slightly difficult to implement; All 
model variables need to be 
created in the validation set 


Once the model is developed, it is 
validated in other settings 
■i Very strong test of model 
performance 
Difficult to implement 

Appropriate population eligibility 
conditions to be applied for 
validation population 
All model variables need to be 
created in the validation set 

Xexl 

















Illustration: To predict the probability that a college student pays fees on time 



Type 

Technique 

Modeling Data 

Validation Data 

Apparent 

Apparent 

Year 2011 batch students of 
College XYZ 

Same as modeling data 

Internal 

(Out-of-Sample) 

Split Sample 

X% (e.g. 80%) random 
sample of year 2011 batch 
students of College XYZ 

Remaining (i.e. 20% of) year 2011 batch students of 
College XYZ 


Cross Validation 
(k-fold) 

1. Divide data into k equal sized random samples. For example, k = 5 

2. Use 4 samples (i.e. 80% data) for modeling and 1 sample (i.e. 20%) for validation 

3. Repeat Step 2 five times so that all 5 samples are used for validation once 

4. Take average of validation metric across 5 samples 


Bootstrapping 

1. Keep aside a holdout sample for validation 

2. Draw 80% random sample (with replacement) for modeling 

3. Repeat Step 2 large number of times (m). For example, m = 1000 times 

4. Keep those variables in final model whose %occurrence in m models > fixed cut¬ 
off (say, 85%) 

External 

Out-of-time (OOT) 


Same population in different time period 

Year 2012 batch students of College XYZ 


Spatial Validation 

Year 2011 batch students of 
College XYZ 

Different population in same time period 

Year 2011 batch students of College ABC 


Fully External 
Validation 


Different population in different time period 

Year 2012 batch students of College ABC 
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1 . 1 . 3 . EXL’s Standard Approach 


Extract data and 
prepare it 


"*■ Master Data 


Ol 




Client / 

Data 

System 


o 


r 


M 


4 


Split data randomly into modeling 
(80%) and validation (20%) 


OOOI 

©I 


Apparent Validation 


Split Sample (Internal Validation) 


Bootstrapping (Internal Validation) 


External Validation 


80% 

Draw 70% 
random 
samples (with 
replacement) T 
1000 times 


20 % - ■ 


T 


Build multiple 

-► models with varied 

lists of predictors 

1 


Shortlist top X models 

-► based on performance 

on validation set 


Create a union of list 
—► of all predictors in top 
X models 


Ml 000 

l 

Model 1000 


Ml M2 M3 

f I 1 

Model 1 Model 2 Model 3 
O Identify variables with %occuice > a fixed cuf-off (e.g. 85%) 

m 1 


Use the variable list of Step 5 for building 
1000 models on 1000 random samples 


Build final model and measure 
performance on modeling set 


<D 


Request Out-of-Time 
sample, if available o 

>■•10 -► Validate final model on OOT dataset 



I 


_ Validate final model on _ 

validation set 

Numbers mentioned in the flowchart are general rules of thumb 

■ At Step 2, split may be 80:20, 70:30 or even 50:50 

■ At Step 6, repetition may be 100 times, 500 times or 1000 times 

■ At Step 6, %random sample may be 80%, 70% or even 50% 
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1.2 Bias and Variance 



1.2.1 Error Decomposition 


Consider a model, where error (s) is normally distributed with zero mean and a constant variance 

Y = f(X) + e such that E(e) = 0and Var(e) =a e 2 


Let /(X) be estimated by model f(X ) 

Expected squared prediction error at a point x 0 is given by: 

Err(x „)= E[Y - f (x 0 )] 2 


<*e 

+ 

[£(/(*„)) ~/(x 0 )] 2 

+ 

E[{f(x 0 )-E{f(x 0 ))f 

<*e 

+ 

[Bias ( f (x 0 ))f 

+ 

Var(f(x 0 )) 


Q, Things to Remember 

■ Bias is a measure of avg. prediction error across samples 

■ Variance reflects how much prediction varies from one 
sample to another 


Noise 


Bias 2 


Variance 


Irreducible error 
on target Y 


Xexl 


Deviation of the average Expected squared 

estimate from the true deviation of model’s 

function’s mean estimate around its mean 
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Prediction Error 


1.2.2. Bias and Variance Trade-Off 


If a model is too simple, the model would 

■ Be unable to fit the true structure 

■ Have a lot of bias (error between the true 
function and model’s approximation) 


High 


High Bias 

Low Variance 


Low Bias 

High Variance 


If a model is too complex, the model would 

■ Overfit to the noise in training sample 

■ Become very sensitive to the particular training 
sample used 

■ Have a lot of variance across training samples 


Low 



Training Sample 


Low 


Optimal Model 
Complexity 


High 

-► 


£ Things to Remember 

■ Training error is typically lower 
than test error 

■ Training error can be reduced by 
increasing model complexity, but 
this risks overfitting 

■ It is recommended to minimize 
the test error to obtain optimal 
level of model complexity 


Model Complexity 


Xexl 
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1.3 Components of Validation 



1.3.1. Sampling Strategies 

Sampling strategies are aimed at addressing the uncertainty that can arise in tests using empirical data 
w Examples: Cross Validation, Bootstrapping, Out-of-Sample and Out-of-Time Validation 

1.3.2. Power-Testing 

Power-testing techniques are aimed at measuring model’s goodness-of-fit 

■i Examples 

Classification Table, K-S Statistic, AUC and Concordance for a classification model 
■ R 2 for a regression model 

1.3.3. Calibration 


Calibration techniques are aimed at assessing how closely the model’s predictions match with the actual 
(i.e. observed) values 

■i Examples 

Hosmer-Lemeshow test for a classification model 

Primary and Secondary Diagonal Metric, ME, MSE, RMSE, MAE, MPE and MAPE fora regression model 


While sampling strategies are meant for model stabilization, power testing and calibration measure model performance 
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Chapter 2: Validation Methods 


Xexl 
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2.1 Classification Model Performance Measures 



2.1.1 Classification Table (Confusion Matrix) 


Classification table 

2x2 matrix of actual and predicted classes 

Also known as Confusion Matrix or Contingency Table 

Greater the sum of primary diagonal (TP + TN), higher the degree of classification accuracy 


Positive, because 
predicted class = 1 
True, because prediction 
is correct 



Predicted Class = 1 


Predicted Class = 0 


Column Total 



Target = 1 
(Event) 


TP 

True Positive 
FN 

False Negative 

TP + FN = E 
#Events 


Negative, because 
predicted class = 0 
False, because prediction 
is wrong 


Positive, because 
predicted class = 1 
False, because prediction 
is wrong 

/ 


Target = 0 
(Non-Event) 


FP 

False Positive 
TN 

True Negative 

FP + TN = NE 
#Non-Events 




Row Total 


TP + FP 

#Cases predicted as Event 
FN + TN 

#Cases predicted as Non-Event 

N = TP + TN + FP + FN 
Total #cases 



Negative, because 
predicted class = 0 
True, because prediction 
is correct 
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SAS Implementation 


Below is the syntax for generating classification table 


PROC LOGISTIC DATA = <modeling dataset> 

NAMELEN = 32 

DESCENDING ; - 

MODEL <dependent> = <regressors> 

/ SELECTION = <se!ection method> 

SLE = <SLE criterion> 

SLS = <SLS criterion> 

CTABLE ; - 

RUN ; 


Specify name of modeling dataset for regression 


This option does not let variable name length get truncated to 20 


This option reverses the sorting order for the levels of dependent variable 



Classification table (generated by ctable option) provides true positives, false positives, true negatives and 
false negatives at varied levels of probability z 

An observation is predicted as event if the predicted event probability exceeds z 


Classification table generated as a part of SAS output can be used to identify the probability cut-off point for classification decision 


Texl 
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Illustrative SAS Output 


, TP + TN 

Accuracy =- 

‘ TP + TN+ FP + FN 

0.30 may be used as the cut-off probability 
level for assigning classes 

If probability > 0.30, predicted class = 1 
If probability < 0.30, predicted class = 0 

Such classification yields 90% accuracy 


Accuracy 



ooooooooooooooo 

Probability Level 


Prob. Correct 

Level Event Non-Event 

0.02 15 / 0 


Incorrect 

Event Non-Event 


135 


0.04 

- 

14 

- 'k 

50 


85 


0.06 

TP 

13 

TN 

75 

FP 

60 

FN 

0.08 


12 


82 


53 



0.10 

0.12 

0.14 

0.16 

0.18 

0.20 

0.22 

0.24 

0.26 

0.28 

fo.To" 

'0V32' 

0.34 

0.36 

0.38 

0.40 

0.42 

0.44 

0.46 

0.48 

0.50 

0.52 

0.54 

0.56 

0.58 

0.60 


11 

10 

10 

10 

9 


7 

7 

7 

_ 6 _ 

6 

2 

2 

2 

2 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 


85 

85 

90 

95 

98 

100 

110 

119 

120 
125 
12 9 " 
12 ” 9 " 

129 

130 
130 
130 
130 
130 
130 
130 
130 
130 
130 
130 
130 
130 


50 

50 

45 

40 

37 

35 

25 

16 

15 

10 

" _ 6 " 

" 6 " 

6 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 


0 

1 

2 

3 

4 

5 
5 

5 

6 
7 


Percentage 

Correct 

10.0 

42.7 

58.7 

62.7 
64.0 

63.3 

66.7 
70.0 

71.3 
72.0 
78.0 
84.0 

84.7 

87.3 
"" 90"."6 j 
” 87 ". Y ' 

87.3 
88.0 
88.0 
87.3 
87.3 
87.3 
87.3 
87.3 
87.3 
87.3 
87.3 

86.7 
86.7 
86.7 


9 

13 ' 

13 

13 

13 

14 
14 
14 
14 
14 
14 
14 

14 

15 
15 
15 


X 


EXL 
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2.1,2. Concordance and Discordance 


Concordant 


A pair of an event and a non-event is said to be a concordant pair if the event observation has higher 
predicted event probability than the non-event observation 

Example: target prediction 



Discordant 


A pair of an event and a non-event is said to be a discordant pair if the event observation has lower 
predicted event probability than the non-event observation 
Example: target prediction 



Tied 


A pair of an event and a non-event is said to be a tied pair if the predicted event probability for both the 
event and the non-event observations is exactly same 

Example: target prediction 


0 


0.90 

1 


0.90 


Xexl 
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Illustration 


Given Data 


ID 

TARGET 

PREDICTION 

1 

0 

0.36 

2 

0 

0.87 

3 

0 

0.42 

4 

0 

0.13 

5 

0 

0.10 

6 

1 

0.40 

7 

1 

0.87 

8 

1 

0.83 


Number of Events :3 

Number of Non-Events : 5 


Number of Distinct Pairs of 
Events and Non-Events 

= #Events x #Non-Events 



= 3x5 


= 15 


PAIR 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 


ID 

TARGET 

PREDICTION 

1 

0 

0.36 

6 

1 

0.40 

1 

0 

0.36 

7 

1 

0.87 

1 

0 

0.36 

8 

1 

0.83 

2 

0 

0.87 

6 

1 

0.40 

2 

0 

0.87 

7 

1 

0.87 

2 

0 

0.87 

8 

1 

0.83 

3 

0 

0.42 

6 

1 

0.40 

3 

0 

0.42 

7 

1 

0.87 

3 

0 

0.42 

8 

1 

0.83 

4 

0 

0.13 

6 

1 

0.40 

4 

0 

0.13 

7 

1 

0.87 

4 

0 

0.13 

8 

1 

0.83 

5 

0 

0.10 

6 

1 

0.40 

5 

0 

0.10 

7 

1 

0.87 

5 

0 

0.10 

8 

1 

0.83 


RESULT 

Concordant 

Concordant 

Concordant 

Discordant 

Tied 

Discordant 

Discordant 

Concordant 

Concordant 

Concordant 

Concordant 

Concordant 

Concordant 

Concordant 

Concordant 


# Pairs = 15 
#Concordant Pairs = 11 
#Discordant Pairs = 3 
#Tied Pairs = 1 


Percent Concordance 



= 11/15 = 73.3 
Percent Discordance 


= 3/15 = 20.0 


Percent Tied 


= 1/15 = 6.7 
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SAS Implementation 


Below is the syntax for computing concordance and discordance metrics 


PROC LOGISTIC DATA = <modeling dataset> 

NAMELEN = 32 

DESCENDING ; - 

MODEL <dependent> = <regressors> 

/ SELECTION = <se!ection method> 


SLE 

= <SLE criterion> 

SLS 

= <SLS criterion> ; 


ODS OUTPUT ASSOCIATION = <outputdata> 
RUN ; 


Specify name of modeling dataset for regression 


This option does not let variable name length get truncated to 20 


This option reverses the sorting order for the levels of dependent variable 



■i In addition to percent concordance, percent discordance and percent tied, the association table reports 
four more metrics: 


Somer’s D 

Goodman-Kruskal Gamma 
Kendall’s Tau-a 

■■ c 
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Illustrative SAS Output 

LST File 


concordance_calculation.lst 


Association of Predicted Probabilities and Observed Responses 


Percent Concordant 

73.3 

Somers' D 

0.533 

Percent Discordant 

20 

Gamma 

0.571 

Percent Tied 

6.7 

Tau-a 

0.286 

Pairs 

15 

c 

0.767 


SAS Dataset 


association.sas7bdat 


La bell 

cValuel 

nValuel 

Label2 

cValue2 

nValue2 

Percent Concordant 

73.3 

73.333333 

Somers' D 

0.533 

0.533333 

Percent Discordant 

20 

20 

Gamma 

0.571 

0.571429 

Percent Tied 

6.7 

6.666667 

Tau-a 

0.286 

0.285714 

Pairs 

15 

15 

c 

0.767 

0.766667 


Guidelines / Thumb Rules 


Percent Concordance 

Interpretation 

<70 

Poor Discrimination 

70-80 

Acceptable Discrimination 

80-90 

Good Discrimination 

>90 

Excellent Discrimination 


Higher percent concordance indicates better good-bad discrimination power 
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Somer' sD= ” c " D 

n D 


Gamma - ——— 
n c +n D 



Tau-a = ” c n ° 

0.5N(N — 1) 

_ n c +0.5 n T 
c 

n p 

where 

N = ttobservati ons in dat aset 
n c =# concordant pairs 
n D =# discordant pairs 
n T = #tied pairs 
n p = total # pairs 
i.e. n p = n c + n D + n T 

T^EXL 


Area under the 
Curve (AUC) 

































2.1.3. Receiver Operating Characteristics (ROC) 


ROC graph is a 2-dimensional graph in which 

True positive rate is plotted on the Y-axis 
False positive rate is plotted on the X-axis 


True Positive Rate (or Sensitivity) 


True Positive Rate 

#Events co rrectly classified as Event 
#Events 
TP 

~ TP+ FN 


False Positive Rate (or 1 - Specificity) 


False Positive Rate 

_ #Non-Events wrongly classified as Event 
# Non-Events 
FP 

~ FP + TN 
TN 

FP + TN 

_ #Non-Events correctly classified as Non-Event 

#Non-Events 

= 1 - Specificity 
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( 0 , 1 ) 


ROC Space 


Perfect Classification 
FP Rate = 0, TP Rate = 1 

=> FP = FN = 0 
=>TN = #Non-Events and 
TP = #Events 


>, 

> 


Such classifiers make 
positive classifications 
only with strong 
evidence 

■ They make few false 

positive errors - 

■ But they often have low ~ 

'(/> 
o 
0 . 
0) 


O 0.6 

CO 


0) 

CO 


true positive rates 


FP Rate = TP Rate = 0 
-> FP = TP = 0 

■ Strategy of never issuing 
a positive classification 

■ No false positive errors 

■ No gains of true positives 

( 0 , 0 ) 


w 









/ 1 

/ 1 

/ 1 
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1 

1 


s 

s 

1 

1 

1 

1 

45° 

- 




1 1 


1 

1 

1 

1 

1 

1 


( 1 , 1 ) 


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

False Positive Rate (1 - Specificity) 




FP Rate = TP Rate = 1 
=> TN = FN = 0 

■ Strategy of never issuing 
a negative classification 

■ No false negative errors 

■ No gains of true negatives 

Such classifiers make 
positive classifications 
with weak evidence 

■ They classify nearly all 
positives (events) 
correctly 

■ But they often have high 
false positive rates 


45° random line 

■ Random guess 

■ 50% of total area lies 
under random line 

i.e. AUC = 0.5 


X 


EXL 
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SAS Implementation 


PROC LOGISTIC DATA = <train dataset> 

NAM ELEN = 32 - 

DESCENDING ;- 

MODEL <dependent> = <regressors> 


I 


OUTPUT 


SCORE 


SELECTION = <selection methock- 
SLE = <SLE criterion> 

SLS = <SLS criterion> 

OUTROC = <train ROC dataset> ; 

OUT = <train predictions> 

P = P_1 ; 

DAT A = <test dataset> 

OUT = ctest predictions> 

OUTROC = <test ROC dataset> 


RUN 


Specify name of modeling dataset for regression 


This option does not let variable name length get truncated to 20 


This option reverses the sorting order for the levels of dependent variable 


Specify variable selection method 


Specify significance level of entry and stay 


-[j'his option creates ROC output dataset for train data; To be used to plot ROC graph 


This option generates train scored dataset 


This option requests for score variable name. Specify P 1 to denote probability of event 


This option requests for name of test dataset as input 


This option generates test scored dataset 


-Tjhis option creates ROC output dataset for test data; To be used to plot ROC graph 


PROC LOGISTIC DATA = <train predictions> DESCENDING ; 
MODEL <dependent> = 

ROC PRED = P_1 ; 

ROCCONTRAST; - 

RUN ; 

PROC LOGISTIC DATA = <testpredictions> DESCENDING ; 


Specify only dependent variable. Do not specify regressors 


Specify P.J as score variable name 


{This option compares Random AUC (0.5) with train AUC and checks significance 



RUN ; 


XL 
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Illustrative SAS Output (Output generated due to roc pred= and roccontrast options) 



Aexl 
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Illustrative SAS Output (Output generated due to outroc =< train roc dataset> option) 

Train ROC Dataset 


train_outroc.sas7bdat 



_STEP_ 

_PROB_ 

_POS_ 

_NEG_ 

_FALPOS_ 

_FALNEG_ 

_SENSIT_ 

_1MSPEC_ 

1 

1 

0.586335 

4 

182 

2 

12 

0.25 

0.01087 

2 

1 

0.292755 

7 

176 

8 

9 

0.4375 

0.043478 

3 

1 

0.107847 

10 

131 

53 

6 

0.625 

0.288043 

4 

1 

0.034099 

16 

0 

184 

0 

1 

1 


Variable Description 

Variable 

Meaning 

_STEP_ 

Model Building Step 

_PROB_ 

Cut-off Probability Level for Assigning Classes 

_POS_ 

No. of Correctly Predicted Events 

_NEG_ 

No. of Correctly Predicted Nonevents 

_FALPOS_ 

No. of Nonevents Predicted as Events 

_FALNEG_ 

No. of Events Predicted as Nonevents 

_SENSIT_ 

Sensitivity 

_1 MSPEC_ 

1 - Specificity 


T 


Train ROC Curve 

ROC Curve 
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Illustrative SAS Output (Output generated due to 0UTR0C=<test roc dataset> option) 


Test ROC Dataset 


P test_outroc.sas7bdat 


_PROB_ 

_POS_ 

_NEG_ 

_FALPOS_ 

_FALNEG_ 

_SENSIT_ 

_1MSPEC_ 

1 

0.982732 

0 

157 

1 

14 

0 

0.006329 

2 

0.943246 

0 

156 

2 

14 

0 

0.012658 

3 

0.829164 

0 

155 

3 

14 

0 

0.018987 

4 

0.586335 

0 

154 

4 

14 

0 

0.025316 

5 

0.292755 

2 

147 

11 

12 

0.142857 

0.06962 

6 

0.107847 

8 

123 

35 

6 

0.571429 

0.221519 

7 

0.034099 

14 

0 

158 

0 

1 

1 


Variable Description 

Variable 

Meaning 

_PROB_ 

Cut-off Probability Level for Assigning Classes 

_POS_ 

No. of Correctly Predicted Events 

_NEG_ 

No. of Correctly Predicted Nonevents 

_FALPOS_ 

No. of Nonevents Predicted as Events 

_FALNEG_ 

No. of Events Predicted as Nonevents 

SENSIT 

Sensitivity 

_1 MSPEC_ 

1 - Specificity 
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Test ROC Curve 















Illustration: Manual Computation of Area Under the Curve (AUC) from ROC Data Points 

Train AUC Calculation 



A 

B 

C 

D 

E 

F 

G 

H 

1 

_PROB_ 

_SENSIT_ 

_1MSPEC_ 

LAG_SENSIT_ 

LAG_1MSPEC_ 

(B) + (D) 

(C)-(E) 

0.5 x (F) x (G) 

2 

0.5863 

0.2500 

0.0109 

0.0000 

0.0000 

0.2500 

0.0109 

0.0014 

3 

0.2928 

0.4375 

0.0435 

0.2500 

0.0109 

0.6875 

0.0326 

0.0112 

4 

0.1078 

0.6250 

0.2880 

0.4375 

0.0435 

1.0625 

0.2446 

0.1299 

5 

0.0341 

1.0000 

1.0000 

0.6250 

0.2880 

1.6250 

0.7120 

0.5785 

6 

\ _ 


_ 1 





AUC = (H) = 0.7210 


Data from TRAIN OUTROC dataset 

Test AUC Calculation 


Data from TEST_OUTROC dataset 
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AUC for Train 


ain Data 



A 

B 

C 

D 

E 

F 

G 

H 

1 

_PROB_ 

_SENSIT_ 

_1MSPEC_ 

LAG_SENSIT_ 

LAG_1MSPEC_ 

(B) + (D) 

(C)-(E) 

0.5 x (F) x (G) 

2 

0.9827 

0.0000 

0.0063 

0.0000 

0.0000 

0.0000 

0.0063 

0.0000 

3 

0.9432 

0.0000 

0.0127 

0.0000 

0.0063 

0.0000 

0.0063 

0.0000 

4 

0.8292 

0.0000 

0.0190 

0.0000 

0.0127 

0.0000 

0.0063 

0.0000 

5 

0.5863 

0.0000 

0.0253 

0.0000 

0.0190 

0.0000 

0.0063 

0.0000 

6 

0.2928 

0.1429 

0.0696 

0.0000 

0.0253 

0.1429 

0.0443 

0.0032 

7 

0.1078 

0.5714 

0.2215 

0.1429 

0.0696 

0.7143 

0.1519 

0.0542 

8 

0.0341 

1.0000 

1.0000 

0.5714 

0.2215 

1.5714 

0.7785 

0.6117 


AUC = (H) = 0.6691 



EXL 











Illustration: Train AUC from SAS Output 



LST File 


concordance.1st 


Association of Predicted 

Percent Concordant 
Percent Discordant 
Percent Tied 
Pairs 


Probabilities and Observed Responses 


56.0 

Somers' D 

0.442 

11.8 

Gamma 

0.651 

32.2 

Tau-a 

0.065 

2944 

c 

0.721 


Train AUC Calculation 


Method 1 


Method 2 


— 

AUC = %Concordant + 0.5 (%Tied) = 56.0% + 0.5(32.2%) = 72.1% = 0.721 


AUC = c = 0.721 
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Xexl 
















2.1.4 Gini Coefficient 


Gini coefficient is a measure of degree of discrimination between goods (non-events) and bads (events) 
Gini coefficient is twice the area between ROC curve and 45° random line of equality 
Gini coefficient varies between 0 and 1 
Gini = 0 implies no discrimination 
Gini = 1 implies perfect discrimination 


Relation between Gini and AUC 


(0,1) 



f- 

i 

> 

o 

CO 

"T 

</> 

.1. 

CO 

o 

c 


a> 

CO 

0.7 

<D 

0.6 

CS 

0.5 

cc 

<D 

> 

0.4 

<0 

0.3 

o 


CL 

0.2 4 

O 


3 

0.1 1 

i- 

( 0 , 0 )* 


ROC Curve 


( 1 , 1 ) 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

False Positive Rate (1 - Specificity) 


Xexl 
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Relation between Gini and Concordance 


Two important points: 

Gini is simply the difference between concordance and discordance 
Gini is equivalent to Somer’s D 


Gini = 2 AUC -1 
= 2 


f - . ^ ^ 


n c + 0.5 n T 


\ 


n, 


-1 


2 n c +n T —n p 
n D 


2n c + n T - ( n c + n D + n T ) 


n, 


n c~ n D 

n p 

= Somer ' s D 


Recall from Section 2.1.2 


Somer's D = ——— 
n p 


c (i.e. AUC) = 


/i c +0.5/i r 
n p 


where 

n c = # concordant pairs 
n D =# discordant pairs 
n T = #tied pairs 
n p = total # 
f.e. n p = n c + n D + n T 
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Illustration: Gini from SAS Output 

LST File (Illustration from Section 2.1.2) 


concordance_calculation.lst 


Association of Predicted Probabilities and Observed Responses 


Percent Concordant 

73.3 

Somers' D 

0.533 

Percent Discordant 

20 

Gamma 

0.571 

Percent Tied 

6.7 

Tau-a 

0.286 

Pairs 

15 

c 

0.767 


Gini Calculation 


Method 1 


Method 2 


Method 3 


Gini = Concordance - Discordance = 73.3% - 20% = 53.3% = 0.533 


Gini = Somer’s D = 0.533 


Gini = 2AUC - 1 = 2(0.767) - 1 = 0.533 
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2.1.5. Cumulative Lift Chart 


Cumulative lift chart (also known as cumulative gains chart) is a widely used measure of model’s effectiveness in 
capturing bads (events) by rank-ordering of population based on model’s score (predictions) 

Lift is not a single value for overall model. It is calculated at bin level. The bins may be: 

■ Deciles (i.e. 10 equal-sized bins); or 

Demi-Deciles (i.e. 20 equal-sized bins); or 

a Percentiles (i.e. 100 equal-sized bins) 

Lift is computed after rank-ordering of records based on model’s score. Scale of score does not matter 
Model performance is generally assessed by examining cumulative lift at top 1,2 or 3 deciles 


Steps for Cumulative Lift Calculation 


Step 1 


Sort data by predicted value (i.e. model’s score) in descending order, given that focus class is TARGET = 1 


Step 2 Divide data into 1 0, 20 or 1 00 equal sized bins 


Step 3 Summarize data at bin level and compute bin population, #events and #non-events for each bin 


Step 4 For each bin, calculate bin lift as ratio of #events captured in the bin to total #events in the dataset 


Step 5 Calculate cumulative lift as %cumulative events captured at bin level 


Step 6 Plot cumulative lift chart with ‘%Cumulative Population’ on X-axis and ‘%Cumulative Events Captured on Y-axis 


XEXL 
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Illustration: Customer Attrition (Target Variable: IND ATTR) 


Train Dataset 


train.sas7bdat 


CUSTJD 

INDATTR 

PRED 

1 

X00001 

0 

0.0062 

2 

X00004 

0 

0.0084 

<Rows Deleted> 

4000 

X08145 

0 

0.0235 

4001 

X08147 

1 

0.0643 

<Rows Deleted> 

19877 

X40001 

0 

0.0463 

19878 

X40003 

0 

0.0044 

19879 

X40004 

0 

0.0810 


Step 1 

||i train 

_sort.sas7bdat 



CUSTJD 

INDATTR 

PRED 

1 

XI4638 

1 

0.1663 

2 

XI2184 

0 

0.1546 

<Rows Deleted> 

4000 

X00696 

0 

0.0266 

4001 

X01066 

1 

0.0245 

<Rows Deleted> 

19877 

XI1431 

0 

0.0009 

19878 

XI8221 

0 

0.0005 

19879 

X00940 

0 

0.0002 


Step 4 


Step 5 


A 

BIN 


B 

OBS 


C 

BADS 


D 

GOODS 


2 

1 

1,987 

134 

1,853 

36.3% 

36.3% 

3 

2 

1,988 

71 

1,917 

19.2% 

55.6% 

4 

3 

1,988 

39 

1,949 

10.6% 

66.1% 

5 

4 

1,988 

41 

1,947 

11.1% 

77.2% 

6 

5 

1,988 

24 

1,964 

6.5% 

83.7% 

7 

6 

1,988 

19 

1,969 

5.1% 

88.9% 

8 

7 

1,988 

14 

1,974 

3.8% 

92.7% 

9 

8 

1,988 

14 

1,974 

3.8% 

96.5% 

10 

9 

1,988 

7 

1,981 

1.9% 

98.4% 

11 

10 

1,988 

6 

1,982 

1.6% 

100.0% 


12 


(B) = 19,879 (C) = 369 (D) = 19,510 


(E) = 100% 


Step 2 

[Ml 

_sort_bin.sas7bdat 




CUSTJD INDATTR 

PRED 

BIN 

1 

XI4638 

1 

0.1663 

1 

2 

XI2184 

0 

0.1546 

1 

<Rows Deleted> 


4000 

X00696 

0 

0.0266 

3 

4001 

X01066 

1 

0.0245 

3 

<Rows Deleted> 


19877 

X11431 

0 

0.0009 

10 

19878 

XI8221 

0 

0.0005 

10 

19879 

X00940 

0 

0.0002 

10 


Step 3 


BIN LIFT = (C) -r (C) CUM LIFT 


m 

trainbin 

_summary.sas7bdat 



BIN 

OBS 

BADS 

GOODS 

i 

1 

1987 

134 

1853 

2 

2 

1988 

71 

1917 

3 

3 

1988 

39 

1949 

4 

4 

1988 

41 

1947 

5 

5 

1988 

24 

1964 

6 

6 

1988 

19 

1969 

7 

7 

1988 

14 

1974 

8 

8 

1988 

14 

1974 

9 

9 

1988 

7 

1981 

10 

10 

1988 

6 

1982 


7vdXL 
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Illustration: Customer Attrition (Target Variable: IND ATTR) 


Continued . . . 


Test Dataset 


Bf test.sas7bdat 


CUSTJD INDATTR 

PRED 

1 

X00002 

0 

0.0281 

2 

X00003 

0 

0.0190 

<Rows Deleted> 

4000 

X08123 

1 

0.1286 

4001 

X08124 

0 

0.0007 

<Rows Deleted> 

19901 

X40011 

0 

0.0123 

19902 

X40015 

0 

0.0003 

19903 

X40027 

0 

0.0318 


Step 1 

BP test. 

_sort.sas7bdat 



CUSTJD 

INDATTR 

PRED 

1 

X00920 

0 

0.1546 

2 

X11300 

1 

0.1319 

<Rows Deleted> 

4000 

X00100 

1 

0.0262 

4001 

X35937 

0 

0.0239 

<Rows Deleted> 

19901 

XI5836 

0 

0.0008 

19902 

X00591 

0 

0.0004 

19903 

X00009 

0 

0.0002 


Step 4 


Step 5 


1 

A 

BIN 

B 

OBS 

C 

BADS 

D 

GOODS 

E 

BIN LIFT = (C) -r (C 

2 

1 

1,990 

126 

1,864 

32.1% 

3 

2 

1,990 

76 

1,914 

19.3% 

4 

3 

1,990 

53 

1,937 

13.5% 

5 

4 

1,991 

36 

1,955 

9.2% 

6 

5 

1,990 

34 

1,956 

8.7% 

7 

6 

1,990 

15 

1,975 

3.8% 

8 

7 

1,991 

22 

1,969 

5.6% 

9 

8 

1,990 

13 

1,977 

3.3% 

10 

9 

1,990 

9 

1,981 

2.3% 

11 

10 

1,991 

9 

1,982 

2.3% 

12 


(B) = 19,903 

(C) = 393 

(D) = 19,510 

(E) = 100% 
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Step 2 


1 JIBI 

_sort_bin.sas7bdat 




CUSTJD INDATTR 

PRED 

BIN 

i 

X00920 

0 

0.1546 

1 

2 

X11300 

1 

0.1319 

1 

<Rows Deleted> 

4000 

X00100 

1 

0.0262 

3 

4001 

X35937 

0 

0.0239 

3 

<Rows Deleted> 

19901 

XI5836 

0 

0.0008 

10 

19902 

X00591 

0 

0.0004 

10 

19903 

X00009 

0 

0.0002 

10 


Step 3 


CUM LIFT 


32.1% 


51.4% 


64.9% 


74.0% 


82.7% 


86.5% 


92.1% 


95.4% 


97.7% 


100 . 0 % 



BIN 

OBS 

BADS 

GOODS 

1 

1 

1990 

126 

1864 

2 

2 

1990 

76 

1914 

3 

3 

1990 

53 

1937 

4 

4 

1991 

36 

1955 

5 

5 

1990 

34 

1956 

6 

6 

1990 

15 

1975 

7 

7 

1991 

22 

1969 

8 

8 

1990 

13 

1977 

9 

9 

1990 

9 

1981 

10 

10 

1991 

9 

1982 


7vdXL 


35 
































Illustration: Customer Attrition (Target Variable: IND ATTR) 


Continued . . . 


Ideal Lift: Model is able to rank order all events above non-events. At Event Rate, Ideal Lift = 100% 

Random Lift: At X% population, X% events are captured by random guessing. Random Lift Curve is 45° line 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 


A 

Bin 


B 

%Cumulative 

Population 


C D 

%Train %Test 

Cumulative Lift Cumulative Lift 


1 

10% 

36.3% 

32.1% 

2 

20% 

55.6% 

51.4% 

3 

30% 

66.1% 

64.9% 

4 

40% 

77.2% 

74.0% 

5 

50% 

83.7% 

82.7% 

6 

60% 

88.9% 

86.5% 

7 

70% 

92.7% 

92.1% 

8 

80% 

96.5% 

95.4% 

9 

90% 

98.4% 

97.7% 

10 

100% 

100.0% 

100.0% 



Interpretation (based on test dataset results) : Any incentive strategy devised for top 20% customers (3,980 out of 19,903 
customers) is expected to capture more than 50% attrition cases (202 out of 393 attrition cases) 
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2.1.6. Kolmogorov-Smirnov (K-S) Statistic 

Meaning 

K-S statistic is the maximum vertical difference between the cumulative lift curve for events (goods) and 
the cumulative lift curve for non-events (bads) 

Word of Caution 

K-S is based on a single point on the good and bad distributions - the point where the cumulative 
distributions are the most different. It shouldn’t be relied upon without carefully looking at the distributions 


Model 1 


Model 2 



KS = 0.45 at 0.60 


- %Events Captured 

%Non-Events Captured 


0% 20% 40% 60% 80% 100% 

%Cumulative Population 


0% 20% 40% 60% 80% 100% 

%Cumulative Population 


Acceptable Model 
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✓ 


For a reasonable model, KS value 
(maximum difference) should be 
attained within top few deciles 


Unacceptable Model 


X 





















Illustration: Customer Attrition (Target Variable: IND ATTR) 

Train Dataset K-S Statistic 


. . . Continued from Section 2.1.5 



%Cumulative Population 



A 

B 

C 

D 

E 

F 

G 

H 

1 

1 

Bin 

Cases 

Bads 

Goods 

Bin Lift for Bads 
(C)+ (C) 

Cumulative Lift 
for Bads 

Bin Lift for Goods 
(D) -r (D) 

Cumulative Lift 
for Goods 

(F)-(G) 

2 

1 

1,987 

134 

1,853 

36.3% 

36.3% 

9.5% 

9.5% 

0.268 

3 

2 

1,988 

71 

1,917 

19.2% 

55.6% 

9.8% 

19.3% 

0.362 

4 

3 

1,988 

39 

1,949 

10.6% 

66.1% 

10.0% 

29.3% 

0.368 

5 

4 

1,988 

41 

1,947 

11.1% 

77.2% 

10.0% 

39.3% 

0.379 

6 

5 

1,988 

24 

1,964 

6.5% 

83.7% 

10.1% 

49.4% 

0.344 

7 

6 

1,988 

19 

1,969 

5.1% 

88.9% 

10.1% 

59.5% 

0.294 

8 

7 

1,988 

14 

1,974 

3.8% 

92.7% 

10.1% 

69.6% 

0.231 

9 

8 

1,988 

14 

1,974 

3.8% 

96.5% 

10.1% 

79.7% 

0.168 

10 

9 

1,988 

7 

1,981 

1.9% 

98.4% 

10.2% 

89.8% 

0.085 

11 

10 

1,988 

6 

1,982 

1.6% 

100.0% 

10.2% 

100.0% 

0.000 

12 


(B) = 19,879 

(C) = 369 

(D) = 19,510 

(E) = 100% 


(G) = 100% 




KS 


TvdXL 
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Illustration: Customer Attrition (Target Variable: IND ATTR) 

Test Dataset K-S Statistic 


. . . Continued from Section 2.1.5 



%Cumulative Population 



A 

B 

C 

D 

E 

F 

G 

H 

1 

1 

Bin 

Cases 

Bads 

Goods 

Bin Lift for Bads 
(C)+ (C) 

Cumulative Lift 
for Bads 

Bin Lift for Goods 
(D) -r (D) 

Cumulative Lift 
for Goods 

(F)-(G) 

2 

1 

1,990 

126 

1,864 

32.1% 

32.1% 

9.6% 

9.6% 

0.225 

3 

2 

1,990 

76 

1,914 

19.3% 

51.4% 

9.8% 

19.4% 

0.320 

4 

3 

1,990 

53 

1,937 

13.5% 

64.9% 

9.9% 

29.3% 

0.356 

5 

4 

1,991 

36 

1,955 

9.2% 

74.0% 

10.0% 

39.3% 

0.347 

6 

5 

1,990 

34 

1,956 

8.7% 

82.7% 

10.0% 

49.3% 

0.334 

7 

6 

1,990 

15 

1,975 

3.8% 

86.5% 

10.1% 

59.5% 

0.271 

8 

7 

1,991 

22 

1,969 

5.6% 

92.1% 

10.1% 

69.6% 

0.226 

9 

8 

1,990 

13 

1,977 

3.3% 

95.4% 

10.1% 

79.7% 

0.157 

10 

9 

1,990 

9 

1,981 

2.3% 

97.7% 

10.2% 

89.8% 

0.079 

11 

10 

1,991 

9 

1,982 

2.3% 

100.0% 

10.2% 

100.0% 

0.000 

12 


(B) = 19,903 

(C) = 393 

(D) = 19,510 

(E) = 100% 


(G) = 100% 




KS 


TvdXL 
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2.1 Decile-wise Event Rate Chart 


In addition to Lift chart, a decile-wise event rate chart is plotted to gauge if the event rate rank orders well 

Moving down from Decile 1 to Decile 10, average value of target (i.e. event rate) should ideally fall 
monotonically 

However, in practice, few instances of reverse breaks may be observed. If such breaks exist but if they 
are neither frequent nor significant, the model may still be accepted 


Illustration: Customer Attrition (Target Variable: IND_ATTR) . . . Continued from Section 2.1.5 


Train Data Event Rate Chart 


Test Data Event Rate Chart 



23456789 10 

Decile (Based on Model Score) 


23456789 10 

Decile (Based on Model Score) 


In general, there is a declining trend in event rate as we move from Decile 1 to Decile 10 
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2.1.8. Hosmer-Lemeshow Test 


Usage 


Hosmer-Lemeshow test is a goodness-of-fit test for a binary target variable 

Unlike many other goodness-of-fit measures, it does not focus on gauging model’s discriminatory power 
but aims at judging how closely the observed and the predicted values match 


Procedure 


1. Observations are divided into 10 deciles based on estimated probabilities 

2. For each decile, compute 

a. Number of observed events (i.e. number of observations with event flag = 1) 

b. Number of expected events (i.e. total number of observations in decile multiplied by average predicted probability) 

3. Discrepancies between observed and expected number of events in the deciles are summarized by the 
Pearson chi-square statistic, which is compared with a chi-square distribution with DF = 8 (#deciles - 2) 

4. A small p-value (<0.05) suggests that the fitted model is not an adequate model 


H-L Test Statistic 


where 



<9, = Observed number of events in group/ 
Nj = Total number of observations in group/ 
K/ = Average predicted probability in group/ 


g = Number of groups (g = 10 in case of deciles) 
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SAS Implementation 


Below is the syntax for generating Hosmer-Lemeshow test statistic 


PROC LOGISTIC DATA = <modeling dataset> 

NAMELEN = 32 

DESCENDING ; - 

MODEL <dependent> = <regressors> 

/ SELECTION = <se!ection method> 

SLE = <SLE criterion> 

SLS = <SLS criterion> 

LACKFIT ; - 

RUN ; 


Specify name of modeling dataset for regression 


This option does not let variable name length get truncated to 20 


This option reverses the sorting order for the levels of dependent variable 
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Illustration: Hosmer-Lemeshow Test (SAS Output) 


LST File 


Bp. hltest.lst 



Partition for the Hosmer and Lemeshow Test 




Target 

= 1 

Target 

= 0 

Group 

Total 

Observed 

Expected 

Observed 

Expected 

1 

45 

3 

2.22 

42 

42.78 

2 

45 

4 

4.70 

41 

40.30 

3 

45 

9 

8.72 

36 

36.28 

4 

45 

11 

12.70 

34 

32.30 

5 

45 

18 

18.88 

27 

26.12 

6 

45 

24 

25.06 

21 

19.94 

7 

45 

29 

28.94 

16 

16.06 

8 

45 

39 

33.91 

6 

11.09 

9 

45 

41 

40.76 

4 

4.24 

10 

41 

38 

40.11 

3 

0.89 


Hosmer and Lemeshow Goodness-of-Fit Test 


Chi-Square 

9.1720 


DF Pr>ChiSq 

8 J 0.3280~ f 
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p-value is quite high (>0.05) and 
therefore the expected frequencies 
are not significantly different from 
the observed frequencies, indicating 
good model fit 


Xexl 










DART 


Exercise 


Exercise 1. Default Payment Probability Prediction Model 

A magazine publication company wants to identify the customers who are likely to default on their subscription 
payments. 

Server : 172.16.70.31 

Location : T:\IND004\sas training\methodology\module_5 

Train Data : train_sample_1 (Number of Observations: 60,733) 

Test Data : test_sampie_1 (Number of Observations: 60,188) 



Variable 

Type 

Label 

1 

CUSTJD 

Num 

Customer identification number 

2 

IND_PAY_DE FAULT 

Num 

Takes value 1 if customer did not pay dues on time 

3 

IND_ADDRESS_CHANGED 

Num 

Takes value 1 if customer changed residential address in past one year 

4 

1N DC R_ST AT_U NPAIDEVER 

Num 

Takes value 1 if customer credit status has ever been tagged as unpaid 

5 

ORDERCNT 

Num 

Number of orders placed by customer during his tenure 

6 

MTHS_TO_ORDER_EXPIRATION 

Num 

Number of months left in expiration of current order 

7 

PROP_DIRECT_ORDER 

Num 

Ratio of number of orders placed by customer via direct channel to total number of orders 

8 

VARIETYRATIO 

Num 

Ratio of number of distinct products used by customer to total number of orders 

9 

INDSOUTHREGION 

Num 

Takes value 1 if customer belongs to south region 

10 

IND_PROM_MAIL_SENT 

Num 

Takes value 1 if any promotional mail was sent to the customer in past 1 month 

11 

CUST_TENURE 

Num 

Customer tenure in months 

12 

INDEASTREGION 

Num 

Takes value 1 if customer belongs to east region 


Build a logistic regression model 

(target variable: IND_PAY_DEFAULT, SLE = SLS = 0.05, selection method: BACKWARD) 
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. . . Continued 


Exercise 



Exercise 1. Default Payment Probability Prediction Model 

For the developed model, 

a. Generate classification table and analyze it to find probability cut-off 

b. Report percent concordance and percent discordance for train dataset 

c. Calculate AUC and Gini for train and test datasets 

d. Calculate Hosmer-Lemeshow statistic for train dataset 

e. Plot cumulative lift chart for train and test datasets 

f. Compute K-S Statistic for train and test datasets 

g. Plot decile-wise default rate for train and test datasets 
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2.2 Linear Regression Performance Measures 


2.2.1 R 2 (Coefficient of Determination) 

/(-Variable Linear Regression Equation 

Observed: Y = p 0 + $ l X l + ... + $ k X k + 8 

Model : Y = p 0 + p 1 X 1 +...+ $ k X k 

R 2 Interpretation 

/V 

Proportion of variation in target variable ( Y) explained by the model ( Y ) 

R 2 is a goodness-of-fit measure, which is also known as coefficient of determination 

R 2 Definition 1 

r2 ESS RSS 
~ TSS ~ TSS 
where 

ESS = X = Ex P lained Sum of Squares (also known as Regression Sum of Squares) 


RSS = " Residual Sum of Squares 


TSS = Y, {y - y) 2 = Total Sum of Squares = ESS + RSS 


R 2 Definition 2 

Q> Things to Remember 

R 2 = (correlation^ ,Y)j 

0 < R 2 < 1 
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2.2.2. Adjusted R 2 


Adjusted R 2 is a modification of R 2 that adjusts for the number of explanatory terms in the model 

Unlike R 2 , adjusted R 2 increases only if the new term improves the model more than expected by chance 
Adjusted R 2 can be negative 
Adjusted R 2 < R 2 

Adj.R> = i_ d —« 2 X« — 
n — (k + m) 

where 

R 2 = Unadjusted R - Square 
n = Number of observations in the sample 
k = Number of explanatory variables 
m = 1 if model has an intercept term; otherwise m = 0 


Higher R 2 and Adjusted R 2 values indicate better model performance 
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2.2.3. Root Mean Squared Error (RMSE) 


Meaning and Usage 

Estimate of standard deviation of the error term 
Calculated as square root of Mean Squared Error (MSE) 

Scale dependent metric which does not have standalone meaning 
Used for comparison across models for model selection 

l±(Y-Y ,) 2 

RMSE = ] — - 

II n 

where 

Y i = Observed value 

K = Predicted value 
n = Number of observations 


^ Things to Remember 

Similar to RMSE, there are few more metrics that 
can be used to compare models 

1. Mean Error (ME) 

2. Mean Squared Error (MSE) 

3. Mean Absolute Error (MAE) 

4. Mean Percentage Error (MPE) 

5. Mean Absolute Percentage Error (MAPE) 


Lower RMSE value indicates better model performance 
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2.2.4 Coefficient of Variation (COV) 


Meaning and Usage 

COV is calculated as ratio of RMSE to Dependent Variable Mean, multiplied by 100 
Unlike RMSE, it is a unit-less expression of variation in data 


COV = 100% 

V 


where 


RMSE = Root Mean Squared Error 
Y = Average Value of Dependent Variable 


Lower COV value indicates better model performance 
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2.2.5. Primary and Secondary Diagonal 


Procedure 

Step 1 
Step 2 
Step 3a 
Step 3b 


^ Things to Remember 

Banding is subjective. Do not manipulate 
bands for generating over-optimistic results 

: Create bands based on actual (i.e. observed) and predicted values 
: Cross tabulate actual and predicted value bands and examine frequency distribution 
: Sum up percentages in primary diagonal ceils to report primary diagonal metric 
: Sum up percentages in secondary diagonal cells to report secondary diagonal metric 


Illustration: Credit Card Payment Due Amount (Target Variable: DUE AMT) 


Primary Diagonal Metric : 31.7% Primary Diagonal 

Secondary Diagonal Metric : 28.4% H Secondary Diagonal 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 


A 

B 

C 

D 

E 

F 

G 

H 

1 

J 





Predicted Value Bands 






1. < IK 

2. IK - 10K 

3.10K-25K 

4. 25K - 50K 

5. 50K - 75K 

6. 75K- 100K 

7. 100K+ 

Total 

« 

-0 

1. < IK 

5.1% 

1.4% 

0.7% 

0.3% 

0.1% 

4.4% 

2.3% 

14.3% 

c 

<0 

2. IK - 10K 

3.6% 

7.1% 

3.8% 

4.5% 

3.3% 

0.8% 

0.1% 

23.2% 

CD 

<D 

3. 10K-25K 

0.0% 

1.4% 

2.0% 

1.5% 

0.3% 

0.0% 

1.4% 

6.8% 

3 

CD 

4. 25K - 50K 

3.0% 

3.1% 

3.2% 

4.6% 

3.5% 

0.3% 

1.2% 

18.7% 

> 

5. 50K - 75K 

1.0% 

0.7% 

0.4% 

1.4% 

3.1% 

1.7% 

1.0% 

9.2% 

vv 

3 

6. 75K-100K 

1.4% 

0.7% 

1.0% 

0.0% 

1.4% 

3.7% 

1.9% 

10.2% 

u 

< 

7. 100K+ 

0.3% 

0.0% 

1.4% 

3.0% 

3.1% 

3.5% 

6.1% 

17.5% 


Total 

14.4% 

14.5% 

12.5% 

15.4% 

14.7% 

14.5% 

14.1% 

100.0% 


Higher primary and secondary diagonal values indicate better model performance 
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2.2.6. SAS Implementation 


SAS Syntax 


PROC REG DATA = <modeling dataset> ; 
MODEL <dependent> = <regressors> 

/ SELECTION = <se!ection method> 

SLE = <SLE criterion> -i 

SLS = <SLS criterioro ;J 

QUIT; 


Specify name of modeling dataset for regression 


Specify variable selection method 


Specify significance level of entry and stay 


Illustration 


LST File 


linear_regression.lst 


The REG Procedure 


Root MSE 

2118.81970 

R-Square 

0.6353 

Dependent Mean 

7219.33125 

Adj R-Sq 

0.6339 

Coeff Var 

29.34925 




Xexl 
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2.2.7 Residual Analysis 


Need for Residual Analysis 

Objective 1 To check whether the residuals are ‘pattern less’ (randomly scattered) centered around zero 

Method of Analysis: Residual Plot 

Objective 2 To check whether the residuals follow a normal distribution 

Method of Analysis: Normal Q-Q Plot 


Residual Plot 


A graph that shows the residuals on the vertical axis and the fitted values on the horizontal axis 

If the points in a residual plot are randomly dispersed around zero (horizontal axis), a linear regression 
model is appropriate for the data, otherwise a non-linear model is more appropriate 


Examples: 



■ Random scatter around zero 

■ Linear regression Is appropriate 



■ Distinct curved pattern (U-shaped) 

■ Linear model is not appropriate (bad fit) 

■ Non-linear model should be tried out 



■ Funnel shaped pattern 

■ More spread for larger fitted values ( 

■ Check for Heteroscedasticity 
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Normal Q-Q Plot 

Quantile-Quantile (Q-Q) plot is a graphical method for comparing two probability distributions by plotting 
their quantiles against each other 

Normal Q-Q plot shows the observed quantiles of residuals on the vertical axis and the theoretical 
quantiles of standard normal distribution on the horizontal axis 

If residuals follow normal distribution, the normal Q-Q plot should be a straight line 
Example: 
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Exercise 



Exercise 2. Spend Prediction Model 

A hospital management wants to have an estimate of monthly spend 

Server : 172.16.70.31 

Location : T :\IND004\sas training\methodology\module_5 

Train Data : train_sample_2 (Number of Observations: 

Test Data : test_sample_2 (Number of Observations: 


(revenue) from each existing patient. 


3.500) 

1.500) 


Variable 

Type 

Label 

1 PATIENT ID 

Num 

Patient identification number 

2 SPEND 

Num 

Monthly spend by the patient 

3 VISITS3M 

Num 

Number of times patient visited hospital in last 3 months 

4 INDSPCLSURGERY 

Num 

Takes value 1 if patient consulted a doctor with specialty in surgery 

5 SEVERITY 

Num 

Severity index of disease (higher value indicates more severe disease) 

6 AGE 

Num 

Age of the patient 


Build a linear regression model 

(target variable: SPEND, SLE = SLS = 0.05, selection method: BACKWARD) 
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. . . Continued 


Exercise 


Exercise 2. Spend Prediction Model 

For the developed model, for train and test datasets compute 

a. R 2 

b. Adjusted R 2 

c. RMSE 

d. Coefficient of Variation 

e. Primary and Secondary Diagonal Metrics 



Xexl 
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Chapter 3: Model Stabilization 


Xexl 
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3.1 Population Stability Analysis 



3.1.1 Population Stability Index (PSI) 

Meaning and Usage 

Widely used stability metric 

Measures the shift in population from development sample to validation sample 


Formula 


PSI = Y 


f 


(%Validation - % Development) x LN 


V 


%Validation N 
% Development y 


Frequency Distribution 



Guidelines for Assessment 

PSI 

Interpretation 

<0.10 

Populations are similar 

0.10-0.25 

Some concern over stability 

>0.25 

Substantial change in populations 


Note: For a continuous variable, the bins are typically created by decile or demi-decile using development sample 

Xexl 
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3.1.2. PSI Applications 


Score Stability Analysis 


PSI metric is calculated based on binning of the 
model score (predicted outcome) 

Objective is to ascertain if the score distribution 
shifted and in what direction 


Illustration: Credit Risk Score 



A 

B 

c 

D 

E 

F 

1 

Risk Score 

DEV 

VAL 

%DEV 

%VAL 

PSI 

2 

<400 

2,000 

10,500 

10.00% 

10.50% 

0.0002 

3 

401-500 

2,000 

9,300 

10.00% 

9.30% 

0.0005 

4 

501-600 

2,000 

10,700 

10.00% 

10.70% 

0.0005 

5 

601-700 

2,000 

9,500 

10.00% 

9.50% 

0.0003 

6 

701-800 

2,000 

10,400 

10.00% 

10.40% 

0.0002 

7 

801-900 

2,000 

10,500 

10.00% 

10.50% 

0.0002 

8 

901-1000 

2,000 

9,100 

10.00% 

9.10% 

0.0008 

9 

1001-1100 

2,000 

9,300 

10.00% 

9.30% 

0.0005 

10 

1101-1200 

2,000 

11,000 

10.00% 

11.00% 

0.0010 

11 

1200+ 

2,000 

9,700 

10.00% 

9.70% 

0.0001 

12 

Total 

20,000 

100,000 100.00% 

100.00% 

0.0043 


58 | June 30, 2015 | © 2015 ExIService Holdings, Inc. 



Characteristic Stability Analysis 


PSI metric is calculated based on binning of a 
characteristic (i.e. explanatory variable) 

Objective is to examine shifts in distributions of 
individual characteristics and to understand if 
high PSI values of a set of characteristics could 
explain high PSI value of overall score 


Illustration: Demographic Characteristic (AGE) 


A B C D E F 


1 

AGE 

DEV 

VAL 

%DEV 

%VAL 

PSI 

2 

<20 

2,000 

9,000 

10.00% 

9.00% 

0.0011 

3 

21-25 

2,000 

9,000 

10.00% 

9.00% 

0.0011 

4 

26-30 

2,000 

11,000 

10.00% 

11.00% 

0.0010 

5 

31-35 

2,000 

9,000 

10.00% 

9.00% 

0.0011 

6 

36-40 

2,000 

12,000 

10.00% 

12.00% 

0.0036 

7 

41-45 

2,000 

7,000 

10.00% 

7.00% 

0.0107 

8 

46-50 

2,000 

11,000 

10.00% 

11.00% 

0.0010 

9 

51-55 

2,000 

11,000 

10.00% 

11.00% 

0.0010 

10 

56-60 

2,000 

7,000 

10.00% 

7.00% 

0.0107 

11 

60+ 

2,000 

14,000 

10.00% 

14.00% 

0.0135 

12 

Total 

20,000 

100,000 

100.00% 

100.00% 

0.0445 


K 


XL 














3.2 Model Stability Boosting Techniques 



3.2.1. k-Fold Cross Validation 

Purpose 

Cross-validation (CV) is a way to predict the fit of a model to a hypothetical validation set when an explicit 
validation set is not available 

Cross validation provides a reasonable estimate of model fit. Usage of CV technique at the time of model 
development provides realistic estimate of benchmark performance and thus infuses stability 


Steps 

1. Randomly divide data into k folds of equal size 

2. Use k-1 folds data for training, and one fold for testing 

3. Repeat k times until all folds are used for testing 


^ Things to Remember 

Advantage: All observations are used for both 
training and validation, and each observation is used 
for validation exactly once 


Illustration 


In 5-fold cross-validation, the data would be split into five equal sets A, B, C, D and E. Models would be developed 
on each four-fifths of the data using the remaining one-fifth for testing as follows: 



TRAIN 

TEST 

1 

ABCD 

E 

2 

ABCE 

D 

3 

ACDE 

B 

4 

BCDE 

A 

5 

ABDE 

C 


The results of 5 test datasets A, B, C, D and E are averaged to get the 
final estimate of model performance 
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3.2.2. Bootstrapping 


Purpose 

Bootstrapping is a very effective technique to identify stable variables for model development 

■e It is a time consuming process and hence it is generally applied once a list of potential predictors (not 
more than 100) has already been identified. The idea is to pick most stable ones out of good performers. 


Steps 

1. Draw m samples (e.g. m = 1000) with 80% obs. selected randomly (with replacement) from train data 

2. Build a model on each sample using a list of predictors and a model selection method (e.g. backward) 

3. For each variable, compute ‘percent occurrence’ over all models 

4. Apply a cut-off (e.g. 85%) on ‘percent occurrence’ to identify stable variables 


Illustration: Telecom Churn (Target Variable: IND_CHURN) 


1 

A 

Variable 

B 

#Models 

C 

#Runs 

D 

Percent Occurrence 


2 

LIFE_ON_FILE 

1,000 

1,000 

100.0% 



3 

DEVICE_QTY 

1,000 

1,000 

100.0% 



4 

ACCT_SIZE 

956 

1,000 

95.6% 


— Stable Predictors 

5 

TOT_MRC_AMT 

882 

1,000 

88.2% 



6 

7 

IND_BASIC_PHONE 

875 

1,000 

87.5% 



OVERAGEAMT 

610 

1,000 

61.0% 

~1 

^ Oj /o OUI-UTT 

8 

POP_PER_SQ_MILE 

481 

1,000 

48.1% 


— Unstable Variables 

9 

SOUTH_REGION 

350 

1,000 

35.0% 

J 
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^ Things to Remember 

Bootstrapping is also used as a 
variable reduction technique along 
with stabilization 


Xexl 








3.2.3. Coefficient Blasting 


Purpose 

Eliminate variables with inconsistent estimates; or 


^ Things to Remember 

Coefficient blasting may also be 
used as an ensemble technique 
along with stabilization 


Replace beta coefficients of original model with average beta values across samples 


Steps 

1. Draw m samples (e.g. m = 1000) with 80% obs. selected randomly (with replacement) from train data 

2. Build a model on each sample using a ‘fixed’ list of predictors without any model selection method 

3. For each variable, analyze the distribution of coefficients 


Illustration: Membership Cancellation (Target Variable: IND CANCEL) 


Coefficient Profile for Number of Complaints 


Original Non Standardized 
Model Coefficient: 0.0626 

* 



0.0209 0.0381 0.0553 0.0724 0.0896 

Coefficient 


li Sigma 

Mean 

Median 

0.0111 

0.0627 

0.0630 


Estimation of model coefficients 
over 1000 samples shows that 
the coefficients of predictors are 
stable and peak around the 
value identified in the original 
model 
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3.2.4 Sensitivity Analysis 


Purpose 

Sensitivity analysis is carried out to gauge sensitivity of the model performance towards variation (+/- 5% 
and +/-1%) in a particular variable 

Steps 

1. Save original model equation and the predicted score 

2. Vary a particular predictor by +1 % (keeping all other predictors fixed) and regenerate score 

3. Repeat step 2 using different percentages (-1 %, +5% and -5%) 

4. Plot original score against new scores generated by variations in a particular predictor and analyze 

5. Repeat steps 2, 3 and 4 for all predictors one by one 

Illustration: Membership Cancellation (Target Variable: IND CANCEL) 

Sensitivity of Score to Variation in Number of Complaints 


i L 


♦ Variation of -5% 


The graph shows that the model is 
not over-sensitive to slight changes in 
the predictor (number of complaints) 


o 

o 



Variation of -1% 


Variation of +5% 



Variation of +1% 


Original Score 
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Thanks 

For queries, contact Varun Aggarwal at Varun.Aaaarwal@exlservice.com 
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